Adding the StringEncoder transformer #1159

rcap107 · 2024-11-26T16:17:05Z

This is a first draft of a PR to address #1121

I looked at GapEncoder to figure out what to do. This is a very early version just to have an idea of the kind of code that's needed.

Things left to do:

rcap107 · 2024-12-05T15:43:10Z

Tests fail on minimum requirements because I am using PCA rather than TruncatedSVD for the decomposition, and that raises issues with potentially sparse matrices.

@jeromedockes suggests using directly TruncatedSVD to begin with, rather than adding a check on the version.

Also, I am using tf-idf as vectorizer, should I use something else? Maybe HashVectorizer?

(writing this down so I don't forget)

GaelVaroquaux · 2024-12-09T14:25:47Z

I'm very happy to see this progressing.

Can you benchmark it on the experiments from Leo's paper: this is important for modeling choices (eg the hyper-parameters)

rcap107 · 2024-12-09T14:27:13Z

I'm very happy to see this progressing.

Can you benchmark it on the experiments from Leo's paper: this is important for modeling choices (eg the hyper-parameters)

Where can I find the benchmarks?

GaelVaroquaux · 2024-12-09T15:07:03Z

Actually, let's keep it simple, and use the CARTE datasets, they are good enough: https://huggingface.co/datasets/inria-soda/carte-benchmark

You probably want to instanciate a pipeline that uses TableVectorizer + HistGradientBoosting, but embeds one of the string columns with the StringEncoder (the one that is either higest cardinality, or most "diverse entry" in the sense of https://arxiv.org/abs/2312.09634

Vincent-Maladiere · 2024-12-09T15:30:42Z

Should we also add this to the text encoder example, along the TextEncoder, MinHashEncoder and GapEncoder? It shows a tiny benchmark on the toxicity dataset.

rcap107 · 2024-12-09T15:32:48Z

Should we also add this to the text encoder example, along the TextEncoder, MinHashEncoder and GapEncoder? It shows a tiny benchmark on the toxicity dataset.

It's already there, and it shows that StringEncoder has performance similar to that of GapEncoder and runtime similar to that of MinHashEncoder

Vincent-Maladiere · 2024-12-09T15:42:11Z

That's very interesting!

rcap107 · 2024-12-17T11:06:23Z

we'll have to be very careful to summarize the tradeoffs between the different encoders in a few lines (a few lines, something short and clear :D ) at the top of the corresponding section of the docs. It is very important that we convey to the user what we have learned This is something for a separate PR though
I'd rather not. IMHO the docs need to be reorganized as we add complexity to the package. Also, the evidence for this recommendation comes from this PR.

I updated the doc page on the Encoders, but it was only to add the StringEncoder and a short summary of the different methods. Looking at the page, I think it would be better to expand on it with more detail for all encoders and maybe an explanation of the parameters, but that's something that would take way more effort (and definitely something for a separate PR).

rcap107 · 2024-12-17T12:24:54Z

Nice! So what is the conclusion regarding the StringEncoder(1, 1)? How can it perform so well against drop and OrdinalEncoder, when it only considers individual characters?

My feeling is that OrdinalEncoder is just not that good if there is no order in the feature to begin with, while strings that are similar to each other usually are related no matter how they are sliced.

I think an interesting experiment would be having a dictionary replacement where all strings in the starting table are replaced by random alphanumeric strings and check the performance of the encoders on that. In that case, I can imagine StringEncoder would not do so well compared to OrdinalEncoder.

doc/encoding.rst

Vincent-Maladiere

Hey @rcap107! Here is a bunch of questions and nitpicks :)

doc/encoding.rst

examples/02_text_with_string_encoders.py

Vincent-Maladiere · 2025-01-06T14:47:27Z

skrub/_string_encoder.py

+        if (min_shape := min(X_out.shape)) >= self.n_components:
+            self.tsvd_ = TruncatedSVD(n_components=self.n_components)
+            result = self.tsvd_.fit_transform(X_out)
+        else:
+            warnings.warn(
+                f"The matrix shape is {(X_out.shape)}, and its minimum is "
+                f"{min_shape}, which is too small to fit a truncated SVD with "
+                f"n_components={self.n_components}. "
+                "The embeddings will be truncated by keeping the first "
+                f"{self.n_components} dimensions instead. "
+            )
+            # self.n_components can be greater than the number
+            # of dimensions of result.
+            # Therefore, self.n_components_ below stores the resulting
+            # number of dimensions of result.
+            result = X_out[:, : self.n_components].toarray()


Maybe L140 to L155 could be brought into a common utils with the text encoder, WDYT?

Sure, not in this PR though

skrub/_string_encoder.py

Co-authored-by: Vincent M <[email protected]>

…df-pca

skrub/_string_encoder.py

Vincent-Maladiere

OpenML downloads still fail and break the CI

…df-pca

jeromedockes

LGTM!! Thanks a lot @rcap107 this is a great addition 🚀

Vincent-Maladiere

Looks good! Thanks @rcap107 :)

GaelVaroquaux · 2025-01-27T14:04:50Z

I'm doing a last review of this PR. It's great, I love it. A tiny comment: I think that I prefer the comparison plot in log; if people agree, I'll implement the change (I'll do something better than what's below)

jeromedockes · 2025-01-27T14:29:24Z

fine by me!

Vincent-Maladiere · 2025-01-27T14:35:37Z

+1

rcap107 added 12 commits November 21, 2024 10:56

Fixing changelog with correct account

ec37e13

Merge remote-tracking branch 'upstream/main'

b3dae47

Merge branch 'main' of github.com:skrub-data/skrub

99e5450

Initial commit

4f7e46e

Update

583250b

Merge branch 'main' of github.com:skrub-data/skrub

4a39f36

Merge branch 'main' of github.com:skrub-data/skrub

ee2f739

Merge branch 'main' into tfidf-pca

30ad689

Merge remote-tracking branch 'upstream/main' into tfidf-pca

d7f1cd7

Updated object and added test

8686d7f

quick update to changelog

eb4de97

Fixed test

96423ba

rcap107 added 7 commits December 7, 2024 12:20

Merge branch 'main' of github.com:skrub-data/skrub

e01637c

Replacing PCA with TruncatedSVD

3a1f6eb

Updated init

398f9db

Updated example to add StringEncoder

3a45f19

Merge branch 'main' of github.com:skrub-data/skrub into tfidf-pca

38a9f2d

Updating changelog.

51856b3

📝 Updating docstrings

58a3559

📝 Fixing example

8e4fce2

rcap107 added 2 commits December 9, 2024 16:20

✅ Fixing tests and renaming test file

afdb361

✅ Fixing coverage

6c6d884

🐛 Fixing the name of a variable

9366d90

rcap107 added 2 commits December 17, 2024 11:15

Updating tests and code to address corner cases

64c43c3

Updating docs for encoders

eb0a131

rcap107 commented Dec 17, 2024

View reviewed changes

doc/encoding.rst Show resolved Hide resolved

rcap107 and others added 3 commits December 17, 2024 15:56

Delete examples/02_text_with_string_encoders_employee_salaries.py

9268331

Adding StringEncoder to doc index

49553d9

Doc fixes

a0afc68

rcap107 self-assigned this Dec 22, 2024

Vincent-Maladiere reviewed Jan 6, 2025

View reviewed changes

rcap107 and others added 5 commits January 16, 2025 14:21

Update doc/encoding.rst

463b8a4

Co-authored-by: Vincent M <[email protected]>

Update skrub/_string_encoder.py

0daec3f

Co-authored-by: Vincent M <[email protected]>

Addressing some comments

757a22f

🐛 Fixing a bug and addressing some comments

57ca040

Merge branch 'tfidf-pca' of https://github.com/rcap107/skrub into tfi…

c24dcc7

…df-pca

jeromedockes reviewed Jan 16, 2025

View reviewed changes

skrub/_string_encoder.py Outdated Show resolved Hide resolved

Vincent-Maladiere reviewed Jan 16, 2025

View reviewed changes

rcap107 and others added 5 commits January 17, 2025 09:55

Renaming a method

5007b99

Merge remote-tracking branch 'upstream/main' into update-1159

002ca3b

Removing unneded code

b9a0074

Merge branch 'tfidf-pca' of https://github.com/rcap107/skrub into tfi…

41c29b5

…df-pca

Removing unneeded method

f6b8631

jeromedockes approved these changes Jan 27, 2025

View reviewed changes

Vincent-Maladiere approved these changes Jan 27, 2025

View reviewed changes

jeromedockes merged commit d905cd1 into skrub-data:main Jan 27, 2025
25 checks passed

GaelVaroquaux mentioned this pull request Jan 27, 2025

String encoder minor changes #1223

Merged

GaelVaroquaux mentioned this pull request Feb 26, 2025

[FEAT] Add LSA encoder #1121

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding the StringEncoder transformer #1159

Adding the StringEncoder transformer #1159

rcap107 commented Nov 26, 2024 •

edited

Loading

rcap107 commented Dec 5, 2024

GaelVaroquaux commented Dec 9, 2024

rcap107 commented Dec 9, 2024

GaelVaroquaux commented Dec 9, 2024

Vincent-Maladiere commented Dec 9, 2024

rcap107 commented Dec 9, 2024 •

edited

Loading

Vincent-Maladiere commented Dec 9, 2024

rcap107 commented Dec 17, 2024

rcap107 commented Dec 17, 2024

Vincent-Maladiere left a comment

Vincent-Maladiere Jan 6, 2025

rcap107 Jan 16, 2025

Vincent-Maladiere left a comment

jeromedockes left a comment

Vincent-Maladiere left a comment

GaelVaroquaux commented Jan 27, 2025 •

edited

Loading

jeromedockes commented Jan 27, 2025

Vincent-Maladiere commented Jan 27, 2025

Adding the StringEncoder transformer #1159

Adding the StringEncoder transformer #1159

Conversation

rcap107 commented Nov 26, 2024 • edited Loading

rcap107 commented Dec 5, 2024

GaelVaroquaux commented Dec 9, 2024

rcap107 commented Dec 9, 2024

GaelVaroquaux commented Dec 9, 2024

Vincent-Maladiere commented Dec 9, 2024

rcap107 commented Dec 9, 2024 • edited Loading

Vincent-Maladiere commented Dec 9, 2024

rcap107 commented Dec 17, 2024

rcap107 commented Dec 17, 2024

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

Vincent-Maladiere Jan 6, 2025

Choose a reason for hiding this comment

rcap107 Jan 16, 2025

Choose a reason for hiding this comment

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

jeromedockes left a comment

Choose a reason for hiding this comment

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

GaelVaroquaux commented Jan 27, 2025 • edited Loading

jeromedockes commented Jan 27, 2025

Vincent-Maladiere commented Jan 27, 2025

rcap107 commented Nov 26, 2024 •

edited

Loading

rcap107 commented Dec 9, 2024 •

edited

Loading

GaelVaroquaux commented Jan 27, 2025 •

edited

Loading